Normalization of Dutch User-Generated Content
نویسندگان
چکیده
This paper describes a phrase-based machine translation approach to normalize Dutch user-generated content (UGC). We compiled a corpus of three different social media genres (text messages, message board posts and tweets) to have a sample of this recent domain. We describe the various characteristics of this noisy text material and explain how it has been manually normalized using newly developed guidelines. For the automatic normalization task we focus on text messages, and find that a cascaded SMT system where a token-based module is followed by a translation at the character level gives the best word error rate reduction. After these initial experiments, we investigate the system’s robustness on the complete domain of UGC by testing it on the other two social media genres, and find that the cascaded approach performs best on these genres as well. To our knowledge, we deliver the first proof-of-concept system for Dutch UGC normalization, which can serve as a baseline for future work.
منابع مشابه
Towards Shared Datasets for Normalization Research
In this paper we present a Dutch and English dataset that can serve as a gold standard for evaluating text normalization approaches. With the combination of text messages, message board posts and tweets, these datasets represent a variety of user generated content. All data was manually normalized to their standard form using newly-developed guidelines. We perform automatic lexical normalizatio...
متن کاملNamed Entity Normalization in User Generated Content
Named entity recognition is important for semantically oriented retrieval tasks, such as question answering, entity retrieval, biomedical retrieval, trend detection, and event and entity tracking. In many of these tasks it is important to be able to accurately normalize the recognized entities, i.e., to map surface forms to unambiguous references to real world entities. Within the context of st...
متن کاملGathering and Generating Paraphrases from Twitter with Application to Normalization
We present a new and unique paraphrase resource, which contains meaningpreserving transformations between informal user-generated text. Sentential paraphrases are extracted from a comparable corpus of temporally and topically related messages on Twitter which often express semantically identical information through distinct surface forms. We demonstrate the utility of this new resource on the t...
متن کاملExploring Word Embeddings for Unsupervised Textual User-Generated Content Normalization
Text normalization techniques based on rules, lexicons or supervised training requiring large corpora are not scalable nor domain interchangeable, and this makes them unsuitable for normalizing user-generated content (UGC). Current tools available for Brazilian Portuguese make use of such techniques. In this work we propose a technique based on distributed representation of words (or word embed...
متن کاملEntity Models for Trigger-Reaction Documents
We define the notion of an entity model for a special kind of document popular on the web: an article followed by a list of reactions on that article, usually by many authors, usually inverse chronologically ordered. We call these documents trigger-reactions pairs. The entity model describes which named entities (persons, organizations, locations, products, urls) are mentioned, their type, how ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013